library(dplyr)
library(Seurat)
library(patchwork)
library(ggplot2)
In this vignette, we are analysing a dataset of …. available at …. There are [number of cells] single cells that were sequenced on the [Illumina NextSeq 500]. The raw data can be found here.
We start by reading in the data:
Read10X() returns a UMI count matrix.
The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column).
CreateSeuratObject uses the count matrix to create a Seurat object.
The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset.
A note on Seurat objects:
Full description here. Seurat objects contain an Assay class, with slots for raw counts (@counts slot), normalized data (@data slot), and scaled data for dimensional reduction (@scale.data slot), and DimReduc class. You can access and add data using GetAssayData, SetAssayData, [['']] and $.
Note: You may get warnings about suppressed column names and feature names cannot have underscores (‘_’), replacing with dashes (‘-’).
# continue from here, find where data is Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "/Users/HollyGiles/Documents/Projects/seurat/data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialise the Seurat object with the raw (non-normalised data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc
## An object of class Seurat
## 13714 features across 2700 samples within 1 assay
## Active assay: RNA (13714 features, 0 variable features)
# Lets examine a few genes in the first 30 cells
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
## 3 x 30 sparse Matrix of class "dgCMatrix"
##
## CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
## TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
## MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
The . values in the matrix represent 0s (no molecules detected). Since most values in an scRNA-seq matrix are 0, Seurat uses a sparse-matrix representation whenever possible. This results in significant memory and speed savings for Drop-seq/inDrop/10x data.
The steps below encompass the pre-processing workflow for scRNA-seq data in Seurat. These represent the selection and filtration of cells based on QC metrics, data normalisation and scaling, and the detection of highly variable features.
Here we explore QC metrics and filter cells based on defined criteria.
The QC metrics are as follows:
Check the metadata object to get the number of unique genes and total molecules (calculated during CreateSeuratObject()).
# Show QC metrics for the first 5 cells
head(pbmc@meta.data, 5)
## orig.ident nCount_RNA nFeature_RNA
## AAACATACAACCAC-1 pbmc3k 2419 779
## AAACATTGAGCTAC-1 pbmc3k 4903 1352
## AAACATTGATCAGC-1 pbmc3k 3147 1129
## AAACCGTGCTTCCG-1 pbmc3k 2639 960
## AAACCGTGTATGCG-1 pbmc3k 980 521
Calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features
Use the set of all genes starting with MT- as a set of mitochondrial genes
# The [[ operator can add columns to object metadata. This is a great place to stash QC stats
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Visualise QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# FeatureScatter is typically used to visualise feature-feature relationships, but can be used
# for anything calculated by the object, i.e. columns in object metadata, PC scores etc.
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
# filter based on QC metrics
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
We employ a global-scaling normalisation method “LogNormalize” that normalises the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalised values are stored in pbmc[["RNA"]]@data.
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). Focusing on these genes in downstream analysis helps to highlight biological signal in single-cell datasets.
The procedure in Seurat is described in detail here, and is implemented in the FindVariableFeatures() function. By default, the function returns 2,000 features per dataset. These will be used in downstream analysis, like PCA.
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:
pbmc[["RNA"]]@scale.dataScale for all genes:
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
Faster option: scale only for previously identified variable features
(Scaling is an essential step in the Seurat workflow, but only on genes that will be used as input to PCA. NB PCA and clustering results are unaffected by only scaling variable genes. However, Seurat heatmaps require genes to be scaled, to make sure highly-expressed genes don’t dominate the heatmap).
pbmc <- ScaleData(pbmc)
Use the ScaleData() function to remove unwanted sources of variation from a single-cell dataset and ‘regress out’ heterogeneity associated with e.g. cell cycle stage, or mitochondrial contamination.
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
In case I need to run more sophisticated normalisation:
For advanced users who would like to use this functionality, we strongly recommend the use of our new normalisation workflow, SCTransform(). The method is described in this paper, with a separate vignette using Seurat v3 here.
By default, only the previously determined variable features are used as input, but can be defined using features argument.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
Seurat provides several useful ways of visualising both cells and features that define the PCA, including VizDimReduction(), DimPlot(), and DimHeatmap()
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: CST3, TYROBP, LST1, AIF1, FTL
## Negative: MALAT1, LTB, IL32, IL7R, CD2
## PC_ 2
## Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
## Negative: NKG7, PRF1, CST7, GZMB, GZMA
## PC_ 3
## Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
## Negative: PPBP, PF4, SDPR, SPARC, GNG11
## PC_ 4
## Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
## Negative: VIM, IL7R, S100A6, IL32, S100A8
## PC_ 5
## Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
## Negative: LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
DimPlot(pbmc, reduction = "pca")
DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Plotting a subset of cells by setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though this a supervised analysis, it remains a valuable tool for exploring correlated feature sets.
Visualise PC1
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
Visualise PC1 - PC15
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
To overcome the extensive technical noise in any single feature for scRNA-seq data, Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set. The top principal components therefore represent a robust compression of the dataset.
Three approaches to consider when selecting the number of PCs:
Supervised: exploring PCs to determine relevant sources of heterogeneity (as above). This could be used in conjunction with GSEA.
Implement a statistical test based on a random null model (a little more time-consuming for large datasets, and may not return a clear PC cutoff).
Heuristic method with ElbowPlot . Commonly used, and can be calculated instantly.
In the example here, all three approaches yielded similar results, and you could be justified in choosing anything between PC 7-12 as a cutoff.
Method 2:
Macosko et al, implements a resampling test inspired by the JackStraw procedure. This method randomly permutes a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. This identifies ‘significant’ PCs as those who have a strong enrichment of low p-value features.
# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
The JackStrawPlot() function provides a visualisation tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of features with low p-values (solid curve above the dashed line). In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs.
JackStrawPlot(pbmc, dims = 1:15)
Method 3:
An alternative heuristic method generates an ‘Elbow plot’: a ranking of principle components based on the percentage of variance explained by each one (ElbowPlot() function). In this example, we observe an ‘elbow’ around PC9-10, suggesting that the majority of true signal is captured in the first 10 PCs.
ElbowPlot(pbmc)
Factors to consider, in addition to the above:
Seurat v3 applies a graph-based clustering approach, building upon initial strategies in (Macosko et al). Importantly, the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same. However, our approach to partitioning the cellular distance matrix into clusters has dramatically improved. Our approach was heavily inspired by recent manuscripts which applied graph-based clustering approaches to scRNA-seq data [SNN-Cliq, Xu and Su, Bioinformatics, 2015] and CyTOF data [PhenoGraph, Levine et al., Cell, 2015]. Briefly, these methods embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
As in PhenoGraph, a KNN graph is first constructed based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). This step is performed using the FindNeighbors() function, and takes as input the previously defined dimensionality of the dataset (first 10 PCs).
To cluster the cells, modularity optimisation techniques are next applied (such as the Louvain algorithm (default) or SLM [SLM, Blondel et al., Journal of Statistical Mechanics]), to iteratively group cells together, with the goal of optimizing the standard modularity function. The FindClusters() function implements this procedure, and contains a resolution parameter that sets the ‘granularity’ of the downstream clustering, with increased values leading to a greater number of clusters. Setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. The clusters can be found using the Idents() function.
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 2638
## Number of edges: 95965
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8723
## Number of communities: 9
## Elapsed time: 0 seconds
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
## 2 3 2 1
## AAACCGTGTATGCG-1
## 6
## Levels: 0 1 2 3 4 5 6 7 8
The goal of tSNE and UMAP algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space. NB Cells within the graph-based clusters determined above should co-localise on these dimension reduction plots. As input to the UMAP and tSNE, use the same PCs as input to the clustering analysis.
# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)
# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
Here we find markers that define each of the clusters via differential expression. FindAllMarkers() identifies positive and negative markers of a single cluster (specified by ident.1), compared to all other cells. You can also test groups of clusters vs. each other, or against all cells. FindMarkers() is for more specific comparisons.
Arguments:
The min.pct argument requires a feature to be detected at a minimum percentage in either of the two groups of cells
The thresh.test argument requires a feature to be differentially expressed (on average) by some amount between the two groups.
Its possible to set both to 0, but with a dramatic increase in time since this will test a large number of features that are unlikely to be highly discriminatory.
To speed up these computations, max.cells.per.ident can be set. This will downsample each identity class to have no more cells than whatever this is set to. While there is generally going to be a loss in power, the speed increases can be significant and the most highly differentially expressed features will likely still rise to the top.
# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## IL32 2.593535e-91 1.2154360 0.949 0.466 3.556774e-87
## LTB 7.994465e-87 1.2828597 0.981 0.644 1.096361e-82
## CD3D 3.922451e-70 0.9359210 0.922 0.433 5.379250e-66
## IL7R 1.130870e-66 1.1776027 0.748 0.327 1.550876e-62
## LDHB 4.082189e-65 0.8837324 0.953 0.614 5.598314e-61
# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## FCGR3A 2.150929e-209 4.267579 0.975 0.039 2.949784e-205
## IFITM3 6.103366e-199 3.877105 0.975 0.048 8.370156e-195
## CFD 8.891428e-198 3.411039 0.938 0.037 1.219370e-193
## CD68 2.374425e-194 3.014535 0.926 0.035 3.256286e-190
## RP11-290F20.3 9.308287e-191 2.722684 0.840 0.016 1.276538e-186
# find markers for every cluster compared to all remaining cells, report only the positive
# ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
# show the top two maerks in each cluster
pbmc.markers %>%
group_by(cluster) %>%
slice_max(n = 2, order_by = avg_log2FC)
## # A tibble: 18 × 7
## # Groups: cluster [9]
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.17e- 83 1.33 0.435 0.108 1.60e- 79 0 CCR7
## 2 1.74e-109 1.07 0.897 0.593 2.39e-105 0 LDHB
## 3 0 5.57 0.996 0.215 0 1 S100A9
## 4 0 5.48 0.975 0.121 0 1 S100A8
## 5 7.99e- 87 1.28 0.981 0.644 1.10e- 82 2 LTB
## 6 2.61e- 59 1.24 0.424 0.111 3.58e- 55 2 AQP3
## 7 0 4.31 0.936 0.041 0 3 CD79A
## 8 9.48e-271 3.59 0.622 0.022 1.30e-266 3 TCL1A
## 9 4.93e-169 3.01 0.595 0.056 6.76e-165 4 GZMK
## 10 1.17e-178 2.97 0.957 0.241 1.60e-174 4 CCL5
## 11 3.51e-184 3.31 0.975 0.134 4.82e-180 5 FCGR3A
## 12 2.03e-125 3.09 1 0.315 2.78e-121 5 LST1
## 13 6.82e-175 4.92 0.958 0.135 9.36e-171 6 GNLY
## 14 1.05e-265 4.89 0.986 0.071 1.44e-261 6 GZMB
## 15 1.48e-220 3.87 0.812 0.011 2.03e-216 7 FCER1A
## 16 1.67e- 21 2.87 1 0.513 2.28e- 17 7 HLA-DPB1
## 17 3.68e-110 8.58 1 0.024 5.05e-106 8 PPBP
## 18 7.73e-200 7.24 1 0.01 1.06e-195 8 PF4
Seurat has several tests for differential expression which can be set with the test.use parameter (see our DE vignette for details). For example, the ROC test returns the ‘classification power’ for any individual marker (ranging from 0 - random, to 1 - perfect).
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
There are several tools for visualising marker expression:
VlnPlot() shows expression probability distributions across clusters
FeaturePlot() visualises feature expression on a tSNE or PCA plot
Also:
RidgePlot()
CellScatter()
DotPlot()
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
DoHeatmap() generates an expression heatmap for given cells and features. Here we plot the top 20 markers (or all markers if less than 20) for each cluster.
pbmc.markers %>%
group_by(cluster) %>%
top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
Use canonical markers to easily match the unbiased clustering to known cell types.
| Cluster ID | Markers | Cell Type |
|---|---|---|
| 0 | IL7R, CCR7 | Naive CD4+ T |
| 1 | CD14, LYZ | CD14+ Mono |
| 2 | IL7R, S100A4 | Memory CD4+ |
| 3 | MS4A1 | B |
| 4 | CD8A | CD8+ T |
| 5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
| 6 | GNLY, NKG7 | NK |
| 7 | FCER1A, CST3 | DC |
| 8 | PPBP | Platelet |
# define cluster IDs
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
# saveRDS(pbmc, file = '../data/seuratObject.rds')
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 patchwork_1.1.2 sp_1.5-0 SeuratObject_4.1.2
## [5] Seurat_4.2.0 dplyr_1.0.10 BiocStyle_2.24.0
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.16 colorspace_2.0-3 deldir_1.0-6
## [4] ellipsis_0.3.2 ggridges_0.5.4 spatstat.data_2.2-0
## [7] rstudioapi_0.14 farver_2.1.1 leiden_0.4.3
## [10] listenv_0.8.0 ggrepel_0.9.1 fansi_1.0.3
## [13] codetools_0.2-18 splines_4.2.1 cachem_1.0.6
## [16] knitr_1.40 polyclip_1.10-0 jsonlite_1.8.0
## [19] ica_1.0-3 cluster_2.1.4 png_0.1-7
## [22] rgeos_0.5-9 uwot_0.1.14 spatstat.sparse_2.1-1
## [25] shiny_1.7.2 sctransform_0.3.5 BiocManager_1.30.18
## [28] compiler_4.2.1 httr_1.4.4 assertthat_0.2.1
## [31] Matrix_1.5-1 fastmap_1.1.0 lazyeval_0.2.2
## [34] cli_3.4.1 later_1.3.0 formatR_1.12
## [37] htmltools_0.5.3 tools_4.2.1 igraph_1.3.5
## [40] gtable_0.3.1 glue_1.6.2 RANN_2.6.1
## [43] reshape2_1.4.4 Rcpp_1.0.9 scattermore_0.8
## [46] jquerylib_0.1.4 vctrs_0.4.1 nlme_3.1-159
## [49] progressr_0.11.0 lmtest_0.9-40 spatstat.random_2.2-0
## [52] xfun_0.33 stringr_1.4.1 globals_0.16.1
## [55] mime_0.12 miniUI_0.1.1.1 lifecycle_1.0.2
## [58] irlba_2.3.5 goftest_1.2-3 future_1.28.0
## [61] MASS_7.3-58.1 zoo_1.8-11 scales_1.2.1
## [64] spatstat.core_2.4-4 promises_1.2.0.1 spatstat.utils_2.3-1
## [67] parallel_4.2.1 RColorBrewer_1.1-3 yaml_2.3.5
## [70] reticulate_1.26 pbapply_1.5-0 gridExtra_2.3
## [73] sass_0.4.2 rpart_4.1.16 stringi_1.7.8
## [76] highr_0.9 rlang_1.0.6 pkgconfig_2.0.3
## [79] matrixStats_0.62.0 evaluate_0.16 lattice_0.20-45
## [82] tensor_1.5 ROCR_1.0-11 purrr_0.3.4
## [85] labeling_0.4.2 htmlwidgets_1.5.4 cowplot_1.1.1
## [88] tidyselect_1.1.2 parallelly_1.32.1 RcppAnnoy_0.0.19
## [91] plyr_1.8.7 magrittr_2.0.3 bookdown_0.29
## [94] R6_2.5.1 magick_2.7.3 generics_0.1.3
## [97] DBI_1.1.3 withr_2.5.0 mgcv_1.8-40
## [100] pillar_1.8.1 fitdistrplus_1.1-8 abind_1.4-5
## [103] survival_3.4-0 tibble_3.1.8 future.apply_1.9.1
## [106] crayon_1.5.1 KernSmooth_2.23-20 utf8_1.2.2
## [109] spatstat.geom_2.4-0 plotly_4.10.0 rmarkdown_2.16
## [112] grid_4.2.1 data.table_1.14.2 digest_0.6.29
## [115] xtable_1.8-4 tidyr_1.2.1 httpuv_1.6.6
## [118] munsell_0.5.0 viridisLite_0.4.1 bslib_0.4.0